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NATURAL LANGUAGE INPUT METHOD AND APPARATUS 

The present invention generally relates to a natural 
language input method and apparatus to allow for computer 
usable data to be input by recognising a natural language 
input which includes pauses. 

When inputting data in a natural language, a user 
can insert pauses in the natural language input which can 
adversely affect the recognition of the natural language 
input, „ 

In particular, in speech recognition which use 
context free grammars if a user inserts pauses other than 
at the places expected by a speech recognition engine 
e.g. at the end of a sentence, the resultant speech 
recognition accuracy can be adversely affected. 

There are many reasons why a speaker may insert 
pauses during speech input e.g. when emphasising words 
where the pauses are not properly interpreted by the 
speech recogniser. Pauses may also occur in the speech 
input where actions are involved. One particular area in 
which this occurs is in the field of multimodal data 
input. 

In order to increase the richness with which a user 
can interact with a machine, it has become common for the 
user to be able to interact the machine using more than 
one type of input device, i.e. more than one modality* 



For example, it is common in speech recognition systems 
used on general purpose computers to allow a user to 
input data using a speech recognition engine, and to 
supplement the input of speech data with mouse data and 
keyboard data. Multimodal systems combine input 
modalities such as touch, pen, speech and gesture to 
allow more natural and powerful communication than any 
single modality would alone. 

When one of the modalities comprises a channel by 
which natural language can be input, in view of the 
interaction by a user with more than one modality at the 
same time, the inputting of data using a second modality 
can affect the inputting of data using natural language 
i.e. when a user is inputting data in a second modality, 
this can cause a delay in the input of natural language. 
For example, when a user uses a multimodal system for 
inputting speech and mouse events , the user may pause 
during speech in order to properly locate the pointer 
controlled by the mouse in order to generate the mouse 
event. This pause in the natural language input can in 
some instances cause errors in the recognition of the 
natural language input. The reason for this is that some 
speech recognition systems use context free grammars for 
the recognition process. A context free grammar defines 
a whole utterance (i.e. a portion of speech between 



pauses). Thus a pause appearing in the middle of what 
the recognition engine expects to be an utterance causes 
the recognition engine to treat the - input speech as two 
shorter utterances. The recognition engine will thus try 
to match the two utterances separately to the grammar 
rules. This causes misrecognition. 

It is therefore an object of the present invention 
to overcome this limitation in the prior art. 

In accordance with the first aspect of the present, 
invention there is provided a data processing apparatus 
for generating a modified data structure which defines 
modified grammar rules for recognition of a natural 
language input with pauses. A data structure defining 
grammar rules for recognition of a natural language is 
received and analysed to identify positions in the 
grammar rules at which pauses can occur in the natural 
language input. This is then used to generate the 
modified data structure. 

In accordance with the present invention, there are 
many different pause criteria which can be used for the 
identification of pauses in the natural language input. 
The criteria can take into account the behaviour of an 
individual user, or whether or not other inputs are used. 

The modified data structure can be generated simply 
by adding a form of marker or tag to the data structure 



to identify positions in the grammar rules at which 
pauses can occur in the natural language input. 
Alternatively or in addition, the grammar rules can be 
fragmented in accordance with the identified positions to 
generate sub grammar rules. The sub grammar rules can be 
arranged hierarchically to form the modified data 
structure . 

The modified data structure, because it contains 
information to allo w f or pauses, can then be use d fo r the 
recognition of the natural language input which includes 
pauses to thereby provide for more accurate recognition. 

In the present invention the natural language input 
can comprise any form of natural language for 
communicating between people. This not only includes the 
conventional natural languages e.g. English, French, 
German etc., but also includes other natural languages 
such as sign language for which recognition can depend 
upon the temporal relationships of natural language units 
e.g. words and grammar rules are applied for recognition. 

The analysis performs a prediction to identify where 
pauses may be inserted in the natural language input. 
This enables the recognition of the natural language 
either with or without the pauses i.e. it allows a user 
to be relaxed about pausing during the input of natural 
language . 



The present invention is particularly suited for use 
with speech recognition as the first modality input. 
Certain speech recognition processes use grammar rules 
e.g. context free grammar rules for the recognition 
process. Unexpected pauses within the input speech can 
cause a reduction of recognition accuracy and thus the 
present invention can be used for generating speech 
recognition modified grammars which take into account 
pau s es withi n the s p eech. - 

The present invention is also particularly suited to 
multimodal input systems in which the first modality is 
a natural language input and a second modality comprises 
associated events e.g. mouse clicks or gestures. In 
order to recognise such multimodal input, the multimodal 
grammar in an embodiment of the invention defines 
multimodal grammar rules by defining grammar rules for 
the recognition of a natural language in conjunction with 
associated events in one or more further modalities. In 
such a system, events in a further modality can affect 
the timing of the input natural language and thus the 
analysis to identify where pauses can occur in the 
natural language can be achieved based on events in 
further modalities. 

In order to enable data to be input into a 
multimodal system a modified multimodal data structure 



defining modified multimodal grammar rules is preferably 
generated in addition to the modified data structure 
defining modified grammar. In the modified multimodal 
grammar rules the relationships between events in the or 
each further modality is defined in relation to the 
modified grammar rules. 

The present invention also provides an apparatus and 
method for generating data in a computer usable form 
u sing the data structu re. The modified da ta structure i s 
used in conjunction with a natural language input for the 
recognition of the natural language input. An example of 
such a system is a speech recognition engine which 
utilises the modified grammar rules in order to perform 
a speech recognition process. 

The present invention also provides an apparatus and 
method for generating data in a computer usable form from 
a multimodal input. Recognised natural language data is 
input together with events for one or more further 
modalities. Also the multimodal modified data structure 
is used which defines the relationship between the 
modified grammar rules and the events in the or each 
further modality. An analysis is carried out to 
determine if the first modality input data and the or 
each further modality input data match any modified 
grammar rule and comply with any related events in the or 



each further modality. If a match is found, computer 
usable data can be generated in dependence upon the 
match. 

The present invention can be embodied as a specific 
hardware system, or as software implemented by a 
processing system. Thus the present invention can be 
embodied as processor implementable instructions for 
controlling a processor e.g. a computer program. Such 
instructions can be provided in physical form to a 
processing system on a carrier medium e.g. floppy disk, 
CDROM, magnetic tape, any other programmable medium, or 
any form of carrier signal such as a signal carried over 
a computer network such as the Internet. 

Embodiments of the present invention will now be 
described with reference to the accompanying drawings, in 
which: 

Figure 1 is a schematic diagram of a generalised 
embodiment of the present invention; 

Figure 2 is a schematic diagram of a second 
embodiment to the present invention ; 

Figure 3 is a schematic diagram of a general purpose 
computer for implementing the second embodiment of the 
present invention ; 
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Figure 4 is a diagram illustrating a multimodal 
input; 

Figure 5 is a flow diagram illustrating the method 
of generating the modified data structure in accordance 
with a second embodiment of the present invention; 

Figure 6 is a diagram illustrating the marking of 
pauses within the input data structure; 

Figure 7a and 7b is a diagram illustrating the 
generation _of the sub gra mmar _rulejs ; 

Figure 8a and 8b illustrate the generation of the 
modified data structure defining the modified grammar 
rules ; 

Figure 9 is a flow diagram illustrating the 
generation of the modified multimodal data structure; 

Figure 10 is a diagram illustrating the input 
multimodal data structure; 

Figure 11 is a diagram illustrating the relationship 
between the sub grammar rules and the second modality 
events ; 

Figure 12 is a diagram illustrating the modified 
multimodal data structure defining the modified 
multimodal grammar rules ; 

Figure 13a and 13b are a flow diagram illustrating 
the generation of computer usable data from a multimodal 
input ; 



Figure 14 is a schematic diagram of a third 
embodiment of the present invention; and 

Figure 15 is a flow diagram illustrating the 
operation of the third embodiment of the present 
invention in generating computer usable data in the form 
of units in the natural language i.e. words. 

The generalised embodiment of the present 
invention will now be described with reference t o 
Figure 1 . 

A processing unit 3 in a processing system is 
arranged to respond to a predetermined selection of 
multimodal inputs. The multimodal inputs are defined by 
rules forming a multimodal data structure. The rules 
employed for input to the processing unit are multimodal 
and are defined in terms of predetermined sequences of 
words in combination with associated second modality 
events. 

An input processor 2 is provided to receive the 
multimodal inputs by the user* The input processor 2 
attempts to match or fit the multimodal inputs to a rule 
to which the processing unit responds. If the multimodal 
input successfully matches a multimodal rule, the 
processing unit responds to the input in a manner 
dependent upon which rules it satisfies. 
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The multimodal input may comprise an instruction, a 
response to a prompt, a message, or a question. Thus the 
processing unit can respond to multimodal input which can 
comprise input data for processing by the processing unit 
and/or input instructions for causing the processing unit 
to perform a function. 

In the illustrated embodiment, a user employs two 
separate modalities for input into the processing system, 
one of which is speech^ The firs t a nd secon d m odalities 
are interrelated such that events in the second modality 
depend upon events in the first modality. 

Since one of the input modalities is speech , the 
input processor 2 makes use of a conventional speech 
recognition (SR) engine which analyses the input speech 
signal to provide a result comprising the words it has 
recognised. In order to perform the speech recognition, 
the speech recognition engine utilises grammar rules in 
the form a data structure in order to perform the speech 
recognition process. The grammar rules define a grammar 
e.g. context free grammar of the natural language 
employed by a speaker providing the speech input which 
corresponds to the words within the multimodal rules 
employed for input to the processing unit. 

Since, in this embodiment, the use of the second 
modality input causes the user to punctuate the speech 



input with pauses, the use of conventional grammar rules 
for the speech recognition engine can reduce the 
performance of the speech recognition engine. The use of 
the second modality input in conjunction with speech 
input can cause a user to vary his delivery of the words 
so that the users natural speech flow can be affected. 
Thus grammar rules extracted simply by extracting the 
words from within the multimodal rule employed for input 
to the pro cess in g uni t wi l 1 n ot t ake into acc ount the 
effect of the second modality. Thus in the present 
embodiment the SR grammar rules to be used by the SR 
engine are obtained by a modification of the SR grammar 
rules within the multimodal data structure. 

A data structure preprocessor 1 is provided to 
receive the multimodal data structure and generate a 
modified data structure defining modified grammar rules 
for the speech recognition engine and a modified 
multimodal data structure for use in the interpretation 
of the multimodal input. 

In order to generate the modified data structure 
defining the modified grammar rules, the data structure 
preprocessor 1 analyses the multimodal rules defined by 
the multimodal data structure to determine the positions 
between spoken words where the speaker is expected to 
pause due to carrying out an action related to his need 



to input using the second modality. The data structure 
preprocessor 1 fragments each grammar rule in the 
multimodal rules on the basis of the pauses to form 
grammar sub rules. The content of the grammar sub rules 
plus their mutual relationships are used by the data 
structure preprocessor 1 to form the modified grammar 
rules defined by the modified data structure. In this 
way the modified grammar rules defined by the modified 
d ata structur e acco m modate the pauses i n f lju e nc e.d __hy_t h e_ 
second modality input. 

Thus the SR engine utilises the modified grammar 
rules provided by the modified data structure in order to 
generate a recognition result comprising a string of 
recognised words. The multimodal rules for input to the 
processing unit comprise a combination of such words and 
associated second modality events. The modified 

multimodal data structure is generated by the data 
structure preprocessor 1 in order to provide an improved 
way of identifying appropriate outputs from the SR engine 
in the light of the combination with recorded second 
modality events. Hence more accurate identification of 
a multimodal grammar rule can be achieved. Within the 
data structure preprocessor 1, the modified multimodal 
data structure is formed by integrating second modality 
events at the newly formed grammar sub rule level of the 
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modified data structure. Thus the data structure 
preprocessor determines what multimodal events should be 
associated with each grammar sub rule within a modified 
grammar rule. The data structure preprocessor 1 forms 
5 the modified multimodal data structure from the 

determined associations . 

The output of the speech recognition engine and the 
input second modality events are compared within the 
in put process or 2 with the associ ation spe cified in t he 
modified multimodal data structure to determine which 
multimodal rule is matched by the multimodal input. If 
a multimodal rule is matched by a multimodal input, an 
appropriate input is generated to the processing unit 3 
in accordance with the multimodal rule matched. 

In this embodiment, the processing system can 
comprise any processing system which requires a 
multimodal interface to implement process within a 
processing unit 3. The functional units can be 

implemented either partly or wholly in software either on 
a special purpose machine or a general purpose computer. 

A more detailed embodiment of the present invention 
will now be described with reference to Figures 2 to 13. 
Figure 2 is a functional diagram illustrating a 
2 5 computer system allowing speech input as a first modality 
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and mouse events as a second modality. The input is used 
to input data and/or instructions to a processing 
application 30. 

A database of multimodal data structures 4 0 is 
provided to store a plurality of multimodal rules for 
defining inputs to the processing application 30. Also 
a database of pause criteria 50 is provided to store 
criteria identifying how the second modality events i.e. 

t he mouse cl ic ks ca n affect th e timing of th e speech 

pattern by a user . 

This embodiment to the present invention is provided 
with a data structure preprocessor 10 and an input 
processor 20 which operate in a similar manner to the 
data structure preprocessor 1 and the input processor 2 
of the previously described embodiment. 

The data structure preprocessor 10 includes a pause 
processor 100 which receives multimodal rules defined by 
multimodal data structures from the databases of 
multimodal (MM) data structures 40. The pause processor 
100 processes the multimodal rule defined by the 
multimodal data structure in accordance with the pause 
criteria read from the database of pause criteria 50. 
The pause processor 100 inputs markers into the grammar 
rules to identify the position of pauses. A modified 
data structure former 101 receives the grammar rules with 



markers and fragments the marked grammar rules using the 
markers in order to form sub-grammar rules. The sub- 
grammar rules are then hierarchically arranged in 
dependence upon their mutual relationships in order to 
form modified grammar rules defined by a modified data 
structure . 

A modified data structure store 102 is provided to 
store the modified data structure. This can then be made 
available t o a ^sp_e_ecJi„rej^jognisje 
in more detail hereinafter. 

The data structure preprocessor 10 also includes a 
modified multimodal data structure former 10 3 which 
receives the input multimodal data structures and reads 
the modified data structure store 102. The modified 
multimodal data structure former 103 determines how the 
mouse events should be associated with each grammar sub- 
rule within the modified grammar rule of the modified 
data structure. The modified multimodal data structure 
former 103 forms a modifiied multimodal data structure in 
accordance with the determined associations. A modified 
multimodal data structure store 104 is provided within 
the data structure preprocessor 10 for storing the formed 
modified multimodal data structure. 

Thus the data structure preprocessor 10 generates 
modified data structures and modified multimodal data 



structures as described hereinabove with reference to the 
previous embodiment. These are used by the input 
processor 20 in order to derive an input for the 
processing application 30. 

This embodiment is provided with a speech input 
device 60 in order to generating a speech signal which is 
input to a speech recogniser 200. The speech recogniser 
200 carries out recognition using the modified data 
stru cture read from the modified dat a structure stor e 102 
provided in the data structure preprocessor 10. The 
output of the speech recogniser 200 comprises a sequence 
of recognised words which are input to a comparator 201. 
Also input into the comparator 201 are mouse events 
generated by a mouse 70. The comparator compares the 
multimodal input with the modified multimodal data 
structures read from the modified multimodal data store 
104 in the data structure preprocessor 10. In dependence 
upon the matching of the input multimodal data with the 
modified multimodal data structure that input is 
generated for the processing application 30. 

This embodiment of the present invention is 
implemented using a general purpose computer and the 
functional units comprise software modules implemented by 
a processor. Figure 3 illustrates the structure of the 
general purpose computer in more detail. 



The computer comprises a processor 59 for 
implementing program code stored in the program storage 
51. When the processor 59 implements the program code, 
the data which is stored in the data storage 52 is 
processed. The computer is also provided with 

conventional random access memory (RAM) 5 3 for use as 
working memory by the processor 59. A keyboard 54 is also 
provided for conventional keyboard input. A display 57 
is provided f or provid in g t h e user wi th a visual ou tput. 
An audio input device 56 is provided to enable a user to 
input speech as the first modality input. A mouse 57 is 
provided as the second modality input device. The 
components of the computer are linked by a control and 
data bus 58. The processor implements a pause processor 
59a by implementing pause processor code read from the 
program storage 51. The processor also implements a 
modified data structure former 59b by implementing the 
modified data structure former code provided in the 
program storage 51. The processor further implements a 
modified multimodal data structure former 59c by 
implementing the modified multimodal data structure 
former code stored in the program storage 51. The 
processor also implements a speech recognition engine 59d 
by implementing the speech recognition engine code stored 
in the program storage 51. Further, the processor 59 
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implements a comparator 59e by implementing the 
comparator code stored in the program storage 51. Also, 
the processor 59 implements a processing application 59f 
by implementing the processing application code stored in 
the program storage 51. 

The data storage and program storage can comprise 
any suitable storage device such as non volatile memory 
e.g. floppy disk, hard disk, programmable read only 
memory„devices„,..or_op:tic_a^^ 
RAM. 

It can thus be seen from this embodiment of the 
present invention that the present invention can be 
implemented by supplying computer code to a general 
purpose computer to implement the functions . A computer 
program can be supplied by providing the computer program 
on any carrier medium such as a storage medium e.g. 
floppy disk, optical disk, magnetic tape etc. or as a 
signal e.g. a signal transmitted over a network such as 
the Internet. 

The method of operation of the data structure 
processor 10 will now be described in more detail with 
reference to Figures 4 to 11. 

This embodiment to the present invention will be 
described with reference to use with a facsimile receipt 
and transmission processor application . 



The multimodal rules are framed in a format which is 
an extension of the Java Speech Grammar Format (JSGF). 
The JSGF accompanies the Java speech API (Application 
Program Interface) as a platform independent method for 
Java programmers to use conventional speech recognition 
engines in Java programs. Version 1.0 of the JSGF 
was released by Sun on 26 October 1998. Under JSGF each 
rule is specified by naming it inside angular brackets 

_ (_< . > ) ___f o_l_lowed _Jby_ an equal sign (_= ) „and a .rule 

definition. The rule definition is in terms of tokens, 
where a token is a word which can .be spoken or a sequence 
of words with a single combined meaning e.g. "New York 
City". The JSGF is extended in the present embodiment to 
accommodate the second modality i.e. mouse click events. 
The mouse click events are treated as tokens and the 
click is considered to be the information content of the 
token. In order for a processor to recognise which 
modality this token comes from, the token consisting of 
"click" is preceded with an exclamation mark ( 1 ) which 
itself is preceded by the modality i.e. "mouse", giving 
overall "mouse i click". If two separate mouse channels 
were employed, then the mouse modalities could be 
separately identified as "mousel ! click" and 
"mouse2 ! click" respectively. When no modality is 
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specified before a token, that token is considered to 
comprise the speech modality (first modality). 

In order to specify how the spoken words are 
coordinated with mouse clicked events, the "ampersand 
symbol" (&) symbol is used to indicate that a mouse click 
event is associated with a particular word or token. By 
way of an example, in the present embodiment one 
multimodal rule of the application multimodal data 
structure is de fin ed as follows: _ 

<fax rule>= 

fax (this & mouse iclick) to (him & mouselclick) 

Accordingly, for this rule to be satisfied, the word 
"fax" must be received via the speech modality, then the 
word "this" must be received by the speech modality in 
association with a mouse click event via the second 
modality, then the word "to" must be received via the 
speech modality and finally the word "him" must be 
received by the speech modality in association with a 
mouse click event via the second modality. 

In the present embodiment, a mouse click event is 
defined as being associated with a given word if it 
occurs at any time after the end of the word preceding 



the given word and before the start of the word following 
the given word* This is shown schematically in Figure 4. 

The pause criteria used in this embodiment and 
stored in the database of pause criteria 50 can comprise 
a general set of rules which are applicable for 
generating inputs for any type of processing application. 
Alternatively, they can be adapted for use with a 
particular application and corresponding multimodal data 
structure s. I n the present example the pause ..criteria, 
are relatively simply defined and hence can be applied to 
a range of applications. The pause criteria consist of 
just one basic rule that is applied uniformly to each 
rule of the multimodal data structure. The basic pause 
rule is that a single pause is possible in relation to 
one or more words of the multimodal rule, provided one or 
more mouse clicks are associated in accordance with the 
above described bounded relation to any such words . The 
single pause is accommodated either directly before or 
directly after the corresponding words, but there is no 
possibility accepted of a pause occurring both directly 
before and directly after the words. 

Thus in accordance with this specific 
implementation of the present invention, the data 
structure preprocessor 10 implements the pause processor 
100, the modified data structure former 101, and the 



modified multimodal data structure 103 as Java code 
modules. This enables the speech recogniser 200 to 
interface to the grammar rules stored in the modified 
data structure store 102 using JSGF. 

The method of implementation of the pause processor 
100 and the modified data structure former 101 will now 
be described with reference to Figures 5 to 8. 

In step SI the multimodal data structure defining MM 

„granun.ar._rLu.le.s. ._ar.e. r_e.c_ei.vLed which __c.orapr.i.s_e speech 

recognition grammar rules and associated mouse events* 
The multimodal data structure is read from the database 
of multimodal data structures 40. 

In step S2 pause criteria are received from the 
database of pause criteria 50. In step S3 the pause 
criteria are applied to the multimodal grammar rule of 
the multimodal data structure to establish positions of 
any pauses of the words of the multimodal grammar rules. 
Assuming that the multimodal grammar rule comprises the 
fax rule mentioned hereinabove and as illustrated in 
Figure 4, the pause processor 100 analyses the words of 
a fax rule to locate any words for which multimodal 
events are associated. The word "this" is identified and 
also the mouse clicks associated therewith in the 
multimodal rule. Thus the pause processor 100 

establishes that, in addition to no pause occurring in 



relation to the word "this", a pause may be located 
directly before the word "this" or directly after the 
word "this", giving three separate possibilities. The 
word "him" is also identified, along with the associated 
mouse click event. Thus the pause processor 100 
establishes that, in addition to no pause occurring in 
relation to the word "him" a pause may be located 
directly before the words "him" or directly after the 
word "him", ag ain gi ving three se para te ly po ssib il iti es. 

In step S4 the pause processor 100 extracts the 
complete chain of words contained in the multimodal rule 
to form a data structure which is equivalent to a 
conventional grammar rule usable by an SR engine. In step 
S5 the pause processor 100 then marks the pauses between 
the words at the established positions in the data 
structure. Three such possibilities were established due 
to the word "this" and three such possibilities were 
established due to the word "him". Since each 

possibility for "this" may arise with each possibility 
for "him", a combination of nine possible arrangements of 
pause marker positions relative to the word order of the 
rule arises. Such arrangements are hereinafter referred 
to as marker configurations, and the nine versions in the 
present example are shown as items 141 to 14 9 of Figure 
6, in which the marked pauses are identified by the 



nomenclature <silence>. Thus in this way a data 
structure is established which comprises a plurality of 
strings of words with pause markers. This is input into 
the modified data structure former 101. 

In step S6 the modified data structure former 101 
generates fragmented speech recogniser grammar rules 
according to the marker positions for any pauses to form 
speech recogniser grammar sub rules. For the fax rule 
each of the marker configurations — -14-1 — to — 1-4-9 — are- 
processed. Fragmentation takes place at the pause marker 
positions. Figure 7a shows the different speech 
recogniser grammar sub rules formed from respective 
marker configurations 141 to 149. Marker configuration 
141 contains no pause markers, hence no fragmentation 
occurs, hence the resulting SR grammar sub-rule is merely 
the same as the initial rule, i.e. <fax this to him>. 
Marker configuration 142 has a pause marker at the end of 
the phrase, hence fragmentation in this case again leads 
to a SR grammar sub-rule the same as the original rule, 
namely <fax this to him>. In marker configuration 143, 
a pause is marker between the words 'to 1 and 'him 1 . In 
this case, since new fragments must be formed in respect 
of this marked pause, two fragments are formed. The 
first fragment consists of the words 'fax this to', and 
the second fragment consists of the word 'him', which 



fragments form SR grammar the sub-rules <fax this to> and 
<him>. In marker configuration 144 , a pause is marked 
between the words 'this 1 and 'to'. Consequently, two 
further new SR grammar sub-rules are formed from the 
fragments either side of the pause marker, namely <fax 
this> and <to him>. Marker configuration 145 is similar 
to marker configuration 144, but has a further pause 
marked at the end of the initial grammar SR rule, after 
th e word 1 him' > This in fact produces no extra f ragments 
compared to marker configuration 144, consequently the 
two SR grammar sub-rules produced by fragmenting marker 
configuration 145 are in fact the same as for 144. In 
marker configuration 146, there is one pause marked 
between the words 'this* and 'to 1 , and a further pause 
marker between the words •to 1 and 'him 1 . The fragments 
that result provide the SR grammar sub-rules of <fax 
this>, <to> and <him>. Note that the SR grammar sub-rule 
<fax this> was also produced from marker configurations 
14 4 and 145, and the SR grammar sub-rule <him> was also 
produced from marker configuration 14 3, but the SR 
grammar sub-rule <to> is a new SR grammar sub-rule which 
was not produced by any of marker configurations 141 to 
145, Marker configuration 147 has just one pause which 
is marked between the words 'fax 1 and 'this 1 . 
Fragmentation here results in two new SR grammar 
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sub-rules, namely <fax> and <this to him> . Marker 
configuration 148 is similar to 147 but has an additional 
pause marked after the word 'him 1 . This however produces 
the same fragments as described for marker configuration 
5 147. Finally marker configuration 149 includes a pause 

marker between the words 'fax 1 and 'this' f and a pause 
marked between the words 'to' and 'him'. This provides 
three fragments providing three SR grammar sub-rules 
which are <fax > , < this to> a n d <him> . Of these three SR 
grammar sub-rules both <fax> and <him> are replications 
of SR grammar sub-rules produced by previous marker 
configurations 141 to 148, whereas <this to> is a further 
new SR grammar sub-rule. 

The above described process produces a number of SR 
grammar sub-rules. In the present case nine such 
different SR grammar sub-rules have been produced, and 
these are shown in Figure 7b. It is important to note 
that the above described procedure does not merely 
represent each combination of dividing the four words 
contained in the phrase 'fax this to him'. That process 
would instead have produced a further possibility of the 
word 1 this 1 , which does not appear in the SR grammar 
sub-rules shown in Figure 7b. It is to be appreciated 
that more complicated standard SR grammar rules will 
typically produce a significantly smaller number of SR 
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grammar sub-rules relative to the total number of 
permutations of words . 

The above processes are repeated for each multimodal 
rule used for generating an input for the processing 
application 30. 

At step S7, the modified data structure former 101 
forms a modified SR grammar rule defined by a modified 
data structure from the SR grammar sub-rules. This 
proc e s s__compr.i s.e s._impl ement i ng_t.heL_l.og.ica l_rjeXai.i_o_ns.hip_ 
between the different SR grammar sub-rules, i.e. in the 
case of the present fax rule, the relative sequences as 
indicated on the right hand side of Figure 7a. This is 
further illustrated for the case of the present fax rule 
in Figures 8a and 8b , where Figure 8a repeats the content 
of the right hand side of Figure 7a except that identical 
outcomes from different marker configurations are not 
duplicated, and secondly the sub-rules are presented in 
their labelled form, e.g. <SR sub-rule 1> rather than 
<fax this to him>. A vertical line represents the "or" 
symbol. Figure 8b represents the content of Figure 8a 
except that further use is made of the "or" symbol 
consisting of a vertical line. The above described 
process is repeated for each multimodal rule for 
generating an input for the processing application to 
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form respective modified SR grammar rules defined by the 
modified data structure. 

At step S8, the modified 25 data structure is input 
into the modified data structure store 102. 

Details of the processing steps involved in forming 
the modified multimodal data structure will now be 
described with reference to the flow diagram of Figure 9. 

In step S10, modified data structure store 102 is 
access ed._and... the _modi f _ie.d„da.t a _s_t r.uc.t ure_i s_input__into 
the modified multimodal data structure former 103. 

In step Sll, the multimodal data structure database 
4 0 is accessed and the multimodal data structure is also 
input into the modified multimodal data structure former 
103. 

In step S12, the modified multimodal data structure 
former 103 determines a set of modality interdependency 
rules by analysing the relationship specified between the 
two modalities in the multimodal data structure. For 
example, in the case of the fax rule described above, the 
modified multimodal data structure former 103 determines 
that one mouse click event is required in association 
with each of the two words "this" and "him". 

As mentioned earlier, under the bounded 
relationship, the timing of the mouse click associated 
with a given word is specified to be at any time after 



the end of the word preceding the given word and before 
the start of the word following the given word. 
Referring now to the pause positions employed informing 
the modified speech recogniser data structure, it is to 
be appreciated that the timing definition of the mouse 
clicks results in those mouse clicks also being allowed 
to take place during the pauses associated with the 
words, as shown schematically in Figure 10, where mouse 
click — configurations — 1-7-1 — to — 1-7-9 — show — the— mouse — c-l-ick- 
timing relationship as applied to marker configurations 
141 to 149 respectively. 

Referring again to Figure 9, at step S13, the 
modified multimodal data structure former 103 associates 
the second modality events i.e. mouse clicks as specified 
by the interdependency rules with the SR grammar 
sub-rules, thereby forming a multi-modal sub-rule in 
correspondence with each SR sub-rule. 

In the case of the fax rule, for example, the first 
sub-rule is initially selected, i.e. SR sub-rule 1. At 
step SI 3 it is determined whether the modality 
interdependency rules define any association of a mouse 
click event with the words 'fax this to him 1 of SR 
sub-rule 1. Referring to Figure 4 it can be seen that 
two separate mouse clicks are indeed required with this 
SR grammar sub-rule, hence two mouse clicks are 
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associated with selected SR sub-rule 1, as shown by item 
191 of Figure 11. Next, SR sub-rule 2 is selected. SR 
sub-rule 2 contains the words 'fax this to ' , hence 
referring to Figure 4 it can be seen that only one mouse 
click, corresponding to the word 'this', is specified to 
take place in association with SR sub-rule 2. 
Consequently at step S13 only one mouse click is 
associated with SR sub-rule 2, as shown by item 192 of 
Figure 11 -The- - process — is — repeated — f or — each — grammar- 
sub-rule from SR sub-rule 1 to SR sub-rule 9. SR sub- 
rule 3 contains only the word 'him 1 , for which one mouse 
click only is required, hence providing the association 
shown as item 193 in Figure 11. SR sub-rule 4 contains 
the words "fax this", hence one mouse click is required 
due to the word 'this 1 , resulting in association with one 
mouse click as shown by item 194 in Figure 11. 
Similarly, SR sub-rule 5 results in association with one 
click as shown as item 195 in Figure 11, this being 
derived from the word 'him 1 . However, SR sub-rule 6 
contains only the word 'to 1 which does not have any mouse 
click specified therewith (see Figure 4). Consequently 
as shown by item 196 of Figure 11, no association with a 
mouse click is allocated by the modified multimodal data 
structure former 103 to SR sub-rule 6. The only word 
contained by SR sub-rule 7 is the word 'fax 1 which also 
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has no mouse clicks therewith. SR sub-rule 8 contains 
the words 'this to him', hence it has two mouse clicks 
associated therewith. SR sub-rule 9 contains the words 
'this to' r hence the mouse click associated with the word 
5 'this 1 is associated with SR sub-rule 9. The above 

process is repeated for each rule of the modified sub- 
rule data structure. 

In step S14, the modified multimodal data structure 
former 103 forms a modified multimodal data structure 

10 comprising the SR sub-rule 2 and the second modality 

(i.e. mouse click) event associations derived for all the 
rules. At step S15, the modified multimodal data 
structure is input to the modified multimodal data 
structure store 104. The form of the modified multimodal 

15 data structure is illustrated in Figure 12. 

Thus the data structure preprocessor 10 generates 
both a modified data structure holding modified grammar 
rules for use by the speech recogniser 200 and also a 
modified multimodal data structure for use in the 

20 analysis of the multimodal input by the comparator 210 in 

order to generate an input for the processing application 
30. 

Details of the processing steps carried out by the 
comparator 42 during the operation of the input processor 



2 will now be described with reference to the flow 
diagram of Figures 13a and 13b. 

In step S20, the grammar sub-rules in the modified 
data structure are loaded into the speech recogniser 
(SR) 200. In step S21 a grammar rule counter n is set to 
1. The grammar rule counter determines the position of 
a sub grammar rule in a sequence for computing the 
modified SR grammar rule. Figure 8b shows the modified SR 

-fax rale organ i s e d i n t o fou r alt e rnatives whe re e acL 

alternative has a unique SR grammar sub-rule at its 
logical start (n=l). In the case of the first 
alternative, for recognition of the whole modified SR fax 
rule to take place, <SR sub-rule 1> needs to be 
recognised. In the case of the second alternative, <SR 
sub-rule 2> needs to be recognised following which <SR 
sub-rule 3> needs to be recognised. This is represented 
by the second line in the equation form of Figure 8b. In 
the case of the third alternative, <SR sub-rule 4> needs 
to be recognised followed by the indicated variations 
with respect to <SR sub-rule 5>, <SR sub-rule 6> and <SR 
sub-rule 3> being recognised. This is represented by 
the third line of the equation form of Figure 8b. In the 
case of the fourth and last alternative of the present 
example, <SR sub-rule 7> needs to be recognised, followed 
by the indicated combinations of <SR sub-rule 8>, <SR 



sub-rule 9> and <SR sub-rule 3> being recognised. This 
is represented by the fourth and final line of the 
equation form of Figure 8b, Thus in the example shown in 
Figure 8b, there are four first SR grammar sub-rules 
(n=l), namely SR sub-rules 1, 2, 4 and 7. One of these 
will be given by the SR as the best match in step S22. 

The procedure will now be described for the case 
when, for the modified SR fax rule, the best match given 
for— a — starting — tu-le — is — SR — sub-rule — 2-, — G©n-fetining— the- 
words 1 f ax this to 1 . At step S23, comparator 201 
determines any associated second modality events required 
for that SR grammar sub-rule. The comparator 201 
determines, from the modified multimodal data structure 
it has received, the requirement that for SR sub-rule 2 
one mouse click is required to have been input during the 
corresponding time portion of the audio input that 
provided the recognition result . 

At step S24 the actual mouse click inputs by the 
operator using mouse 70 are analysed by the comparator 
201 to determine which, if any, of said events occurred 
in the time period corresponding to the relevant audio 
input. At step S2 5, the comparator 201 determines 
whether the above described required mouse click events 
are consistent with the actual received events. If they 
are not consistent, then the comparator 201 allows a time 
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out period to allow for input not yet properly processed, 
or in the case a mouse click due to appear at the end of 
a pause, not yet received. After expiration of the time- 
out period, at step S27 the comparator 201 once again 
carries out a determination step as to whether the 
received and required mouse click inputs are consistent. 
If they are still not consistent then at step S28 it 
determines whether there are any further candidate 
u ntried matches for the initial SR grammar s ub-rules. If 



10 there are none, in step S30 it is determined if the SR 

grammar sub-rule being processed is the first in the 
sequence and if so then at step S36 the output result is 
set as "error". If the SR grammar sub-rule being 
processed is not the first in the sequence, in step S31 

15 a previous SR grammar sub-rule is tried and in step S28 

it is once again determined if all matches have been 
tried. 

If at step S28 there are further best matches of SR 
grammar sub-rules which have not been tried, then the 
20 comparator 201 receives the next best match for an n th SR 

grammar sub-rule from the speech recogniser 200 in step 
S29, and the process returns to step S23. 

In the above described procedure, if at step S2 5 or 
step S2 7 the received and required second modality events 
25 were in fact consistent, then at step S32 the n th SR 
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grammar sub-rule thus processed is stored as a currently 
identified n th SR grammar sub-rule. Thus in the present 
example, SR sub-rule 2 containing the words 'fax this to' 
is so identified. 

At step S33 the comparator 201 determines whether a 
modified SR grammar rule is completed with the identified 
n th SR grammar sub-rule. In the present example of the 
modified SR fax rule, had SR sub-rule 1 been identified 
a s — t h e~ c o rr e ct — s.t a rt i ng_SR— g ramma r_sub^r.ule~thenjndeed 
the overall modified SR fax rule would have been 
satisfied. In this case the next step would have been 
S4 0 in which a result is set as matched to the 
corresponding SR grammar rule. However, in the present 
example SR sub-rule 2 does not fulfill the complete 
modified SR grammar rule, and hence the process moves to 
step S37 in which n is incremented. The process then 
returns to step S22 to receive the best match for the 
next SR grammar sub-rule in the sequence. 

In the present case where the initial SR grammar 
sub-rule (n=l) is SR sub-rule 2, if SR sub-rule 3 is the 
best match for the next SR grammar sub-rule (n=2), at 
step S23, the comparator 201 determines from the modified 
multimodal data structure whether any mouse click events 
are required for SR sub-rule 3. It will be determined 
that one mouse click is required during the time period 



of the audio input that has been recognised as SR 
sub-rule 3, 

At step S24, the comparator 201 determines whether 
any such mouse click event was indeed received from the 
mouse 70 during the appropriate time. At step S25, the 
comparator 201 determines whether the received and 
required versions are consistent. If not, then the 
comparator 201 allows a time-out period to receive any 
outstanding — mouse — click — events — to — be — processed — or- 
received, and thereafter at step S2 6 determines whether 
any such events have been received during the time-out 
period. At step S2 7 the comparator 201 determines 
whether the updated version of the received results is 
now consistent with the required results. If not, then 
at step S2 8 it is determined whether there are further 
untried matches for this next SR grammar sub-rule (n=2). 
If so, in step S2 9 the next best match for this next SR 
grammar rule is received from the speech recogniser 200 
and the process returns to step S31. Thereafter the 
process is repeated as described above . 

If at S2 8 there were instead no more untried matches 
for this next SR grammar sub-rule ( n=2 ) available from 
the speech recogniser 200, then in terms of the overall 
procedure the currently identified starting SR grammar 
sub-rule is inadequate. Consequently the comparator 201 
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attempts to identify a more suitable preceding SR grammar 
sub-rule by decrementing the counter n (step S31) and 
returning to step S2 8, where it first determines whether 
any untried matches for the preceding SR grammar sub-rule 
5 are available. If they are, then the overall process is 

repeated starting again at step S29, i.e. a next best 
match for the preceding SR grammar sub-rule is received 
and the process continued from there. 

I f_ at-st ep_S3-0 .-no_more_matc he s__o f _initi a l_SR_grammar._ 

10 sub-rules are available, in other words all combinations 

of matches of initial SR grammar sub-rules and 
consequential following SR grammar sub-rules have been 
exhausted, then at step S36 the result is set as 
'error'. This would mean that no satisfactory speech 
15 recognition result has been achieved that is also 

consistent with the received mouse click events. 

Returning now to the processing of the best match 
for the next SR grammar sub-rule (n=2), if at either of 
steps S25 or S27 it is determined that the received and 
20 required results are in fact consistent, then the next 

step carried out is, S32, in which the next SR grammar 
sub-rule (n=2) whose recognition result has been 
determined as consistent is stored as the currently 
identified 2 nd (n th ) SR grammar sub-rule. 
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At step S33, it is determined whether the currently 
identified initial SR grammar sub-rule followed by the 
currently identified second SR grammar sub-rule together 
form a completed modified SR grammar rule. In the 
5 present example where the currently identified initial SR 
grammar sub-rule is SR sub-rule 2 and SR sub-rule 3 has 
since been identified as the next SR grammar sub-rule, 
then the whole modified SR fax rule is indeed completed 
since SR sub-rule 2 followed by SR sub-rule 3 re presents 

10 the second alternative shown in Figure 8b. If, however, 

in another example SR sub-rule 4 was identified as the 
currently identified initial SR grammar sub-rule, and 
thereafter SR sub-rule 5 was identified as the currently 
identified next SR grammar sub-rule, then, as can be seen 

15 from the third alternative of Figure 8b, the result so 

far is favourable, but nevertheless a further following 
SR grammar sub-rule, namely SR sub-rule 3, is still 
required to complete the modified SR fax rule consisting 
of SR sub-rule 4 followed by SR sub- rule 6 followed by 

20 SR sub-rule 3. In this case the process would return to 

step S37 to increment n and then return to step S22. 
Then, so long as speech recogniser 2 00 provides a 
recognition result for SR sub-rule 3 as the best match of 
a third SR grammar sub-rule, the process will continue 

25 again from S23 to verify the consistency of that 
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recognition result with the corresponding mouse click 
event received . 

After the comparator 201 determines , at step S33, 
that a modified SR grammar rule has been completed, at 
step S34 it sets the result as matched to the 
corresponding modified SR grammar rule. 

The final step is that shown at S35, in which the 
result, be it a successful match via step S34 or an 
"error-^—mes saga- via~~s tep~S34>, is^used-to-generafee-aH-i-nput- 
for the processing application 30. The input can be data 
provided by the multimodal input and/or instructions as 
interpreted in accordance with the modified multimodal 
grammar rules. 

It can thus be seen that this specific embodiment 
provides for improved speech recognition when using a 
multimodal input since the second modality can be used to 
improve the speech recognition result by predicting when 
pauses may be inserted in the first modality input whilst 
inputting data using the second modality. The search of 
the possible matching sub-grammars is achieved as 
illustrated in Figures 13a and 13b using a process to 
search the "tree structure" of the sub-grammar rules. 
Branches from initial sub-grammar rules to next sub- 
grammar rules can be explored and if not successful, a 
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new initial grammar can be tried together with the 
branches therefrom to next sub-grammar rules. 

In the above described detailed embodiment, the 
pause criteria consisted of the simple general rule that 
5 a single pause is accommodated either directly before or 

directly after a word or words provided one or more mouse 
clicks are associated in the bounded relation to the word 
or words. Even when maintaining such a simple form of 
pause criteria specification, a number of variations are 

10 possible. One detail in the above embodiment was that it 

was not possible for a pause to occur both directly 
before the word and directly after the word. However, in 
an alternative embodiment, the pause may indeed occur 
both directly before and directly after the word. 

15 In the above embodiment the same pause criteria is 

applied to all of the rules of the application multimodal 
data structure. In other embodiments different pause 
rules can be set for different rules of the multimodal 
data structure. Different pause rules could be ascribed 

20 based on the classification of the type of rule in the 

multimodal data structure. 

All of the above alternatives broadly speaking 
represent a type of automatic pause criteria 
specification in which the required operations are 

25 predetermined. in other embodiments, in addition to or 



instead of such automatic pause criteria, customised 
pause criteria can be input by an operator as required. 
Such input by an operator may be made available either on 
a rule by rule basis, or as a customised input that is 
applicable to all of the rules. Certain pause criteria 
can be input based on a response of a user to a query, 
where the response to the user is in a format understood 
by the user and which does not specifically detail 

^nyjtJi±ng_abo.u:t^the~pau^ Eor_ 

example the application may present a number of queries 
such as 'do you wish speech processing to allow long 
pauses when making mouse clicks? 1 . Alternatively, 
combinations of automated and customised pause criteria 
can be formulated by the processor using algorithms 
employing historical data taken whilst monitoring a 
user's use of such a system and adapting the pause 
criteria to the particular traits of a particular user. 
Such trait matching could also be achieved by a profile 
input by a user. 

In the above embodiment, individual words of the 
natural language, i.e. English, form the basis of the 
pause criteria in the view of the use of mouse clicks to 
identify details related to the spoken words. In other 
embodiments, key types of word or data blocks that 
generally are associated with mouse click events could be 



used. Furthermore, when the second modality is another 
modality other than mouse clicks, this may in itself lead 
to particular types of grammar structure or units being 
the likely causes of pauses* 

In the above embodiment a bounded relationship was 
employed to define an association between a multimodal 
event and spoken word. In other embodiments, different 
definitions can be specified. One possibility is that 

t he mo.u s.e c .1 ick e v. en t „mu s.t oc cju r wh i 1st the wor_d_i s. 

actually being spoken . 

In the above embodiment the entire modified data 
structure is entered into the speech recogniser from the 
modified data structure store 102 prior to the speech 
recogniser 200 processing speech input. Alternatively as 
standard Speech Application Programmer's Interface (SAPI) 
is used, enabling just a portion of the modified data 
structure to be transferred initially to the speech 
recogniser 200 then based on feedback results from the 
speech recogniser 200 to a processor controlling the 
modified data structure store 102, further parts of the 
modified data structure are transferred to the speech 
recogniser 200 as required. In the latter embodiment, 
for example, for the case of the SR fax rule only the 
starting SR grammar sub-rules, i.e. sub-rules 1, 2, 4 and 
7 are transferred initially to the speech recogniser. 



Then, depending upon the progress of the procedure shown 
in Figures 13a and 13b, particular following SR grammar 
sub-rules are transferred to the speech recogniser 200 as 
required. A further alternative is that all the modified 
data structure is transferred to the speech recogniser 
200 initially, but only some of the SR grammar sub-rules 
are initially activated within the speech recogniser. 

In the above described specific embodiment, timing 
detaiJ-S-of-the—words-recogniaed— by— the— speech— recognises 
2 00, more particularly a start time and an end time, are 
used in conjunction with the time record of the mouse 
click events to determine whether the required 
association had occurred. In another embodiment, instead 
of an absolute time basis, the association can be based 
merely on the required sequential number of second 
modality events occurring e.g. 

speech input: fax this to him 

mouse input: click click 

In other embodiments the processing operations are 
implemented in systems other than the computer 
arrangement described. For example, the speech 

recogniser can be a completely separate entity from other 
processing units. Similarly, the data structure 

preprocessor 1 may be arranged in a different module to 
the input processor 2. Indeed, any of the above 



described functions can be implemented in a suitable type 
of processing arrangement, including distributed 
arrangements . 

In other embodiments, other modalities other than 
mouse clicks can form the second modality. Possibilities 
include keyboard input, gestures, for example via 
suitable video camera inputs, in particular pointing, 
touch screen inputs, and so on. Also, more than one 
modality — other — than— speech — can — be— accommodated —Eor- 
example, in addition to the first modality of speech, a 
second modality consisting of mouse click events and a 
third modality consisting of gestures can be included. 
Also, two different channels of a similar type of 
modality can be accommodated, for example right mouse 
button clicks and left mouse button clicks. 

The first modality need not be speech, rather it is 
merely limited to being any input in the form of a 
natural language which is to be recognised using grammar 
rules and for which the temporal relationship of tokens 
e.g. words is important. For example, sign language as 
used by deaf people could form the first modality in 
another embodiment of the present invention. 

The above embodiments provide improvements in 
accommodating pauses that arise in a natural language 
input due to the interaction with further modalities 



other than that conveying the natural language input. 
Fundamentally, the further modalities impose restrictions 
on the proper flow of the natural language input and can 
affect the recognition result for the natural language 
input modality. 

In its broadest aspects the present invention is not 
limited to multimodal inputs. The present invention is 
also applicable to a single modal input comprising a 

n atural-l.ang.uage i nput— wh i c h— i s~ to— be— r e c o gnise d-u-sin-g- 

grammar rules and for which the temporal relationships 
between tokens e.g. words is important. A primary example 
of such an input is speech although another example is 
sign language. 

For such an input, a user may insert pauses 
inadvertently either when emphasising something or due to 
a particular style of input e.g. particular style of 
speech. The present invention is capable of compensating 
for this by generating modified grammar rules in the form 
of a modified data structure for use by the input 
recognition engine . 

An embodiment receiving a single modal input will 
now be described with reference to Figures 14 and 15. 
This embodiment is able to generate modified speech 
recognition grammar rules to take into account pauses 
inserted by a user. In a conventional speech recognition 
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engine , when recognition is carried out on speech in 
which pauses occur other than at the end of a sentence, 
incorrect recognition can result. 

Referring now to Figure 14 , a data structure store 
5 120 is provided to store a data structure defining a 

speech recognition grammar. A pause criteria store 130 
is provided to store criteria for defining where pauses 
can occur in the grammar rules. 

A data structure preprocessor 110 is provided for 

10 reading SR grammar rules from the data structure store 

120 and pause criteria from the pause criteria store 130. 
Using the pause criteria, the data structure can be 
modified for use by a speech recognition engine which is 
provided in an input processor 140. For speech 

15 recognition engines which can accept tokens defining 

silence, the modified data structure generated by the 
data structure preprocessor 110 can simply comprise the 
speech recognition grammar read from the data structure 
store 12 0 with the pause or silence markers inserted 

20 therein. Alternatively, for speech recognition engines 

which do not recognise tokens identifying pauses or 
silence, the data structure preprocessor 110 can carry 
out a further grammar rule modification step of 
fragmenting the grammar rules into speech recognition 

25 grammar sub-rules as described hereinabove with regard to 



the embodiment concerned with multimodal grammar rules. 
Thus, the speech recognition engine within the input 
processor 140 can be provided with speech recognition 
grammar sub-rules. This will enable the speech 
recognition engine to more accurately carry out 
recognition on input speech. Thus the input processor 
14 0 is able to more accurately generate an input for a 
processing unit 150 to receive the result of the speech 
xecognition_a.-g-. data_and/or„„ commands . 

The method of operation of this embodiment to the 
present invention will now be described with reference to 
the flow diagram of Figure 15. 

In step S50 the data structure defining grammar 
rules for generating an input to a processing unit is 
input from the data structure store 120. In step S51 a 
modified data structure defining fragmented or marked 
grammar rules is determined using the speech pause 
criteria read from the pause criteria store 130. In step 

552 the modified data structure is used by the speech 
recognition engine to recognise input speech and in step 

553 the recognised words are input into the processing 
unit 150 either as data or commands. 

Although the present invention has been described 
hereinabove with reference to specific embodiments, the 
present invention is not limited to these embodiments and 
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it will be apparent to a skilled person in the art that 
modifications can be made without departing from the 
spirit and scope of the present invention. 



CLAIMS : 



1 . Data processing apparatus comprising : 
receiving means for receiving a data structure 

defining grammar rules for recognition of a natural 
language input ; 

analysing means for analysing the data structure to 
identify positions in the grammar rules at which pauses 
ca n occu r in the natural la nguag e inpu t; and 

generating means for generating a modified data 
structure defining modified grammar rules for recognition 
of a natural language input with pauses therein. 

2. Data processing apparatus according to claim 1, 
wherein said analysing means is adapted to identify the 
positions in accordance with pause criteria for the 
natural language input . 

3. Data processing apparatus according to claim 1 or 
claim 2 wherein said generating means is adapted to add 
marker means to the identified positions in the grammar 
rules at which pauses can occur in the natural language 
input to generate the modified data structure. 

4. Data processing apparatus according to any preceding 
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claim wherein said generating means is adapted to 
fragment the grammar rules in accordance with said 
identified positions to generate sub grammar rules to 
form said modified data structure. 

5. Data processing apparatus according to claim 4, 
wherein said generating means is adapted to form a 
hierarchical structure using said sub grammar rules to 

form— said— mod-if-ied- data— s-ta^uctu^e-. 

6. Data processing apparatus according to any preceding 
claim wherein said receiving means is adapted to receive 
a data structure defining grammar rules for use in speech 
recognition of a natural language speech input, and said 
generating means is adapted to generate said modified 
data structure defining modified grammar rules for speech 
recognition of a natural language speech input with 
pauses therein. 

7. Data processing apparatus according to any preceding 
claim wherein said receiving means is adapted to receive 
said data structure defining grammar rules for 
recognition of a natural language input as a first 
modality input in conjunction with associated events in 
at least one further modality input, said data structure 
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defining the association between events in each modality 
input, and events in said first modality input comprising 
units in the natural language. 

5 8. Data processing apparatus according to claim 7 , 

wherein said analysing means is adapted to identify said 
positions in the grammar rules based on events in at 
least one said further modality input. 



10 9. Data processing apparatus according to claim 7 or 

claim 8, wherein said generating means is adapted to 
generate a further modified data structure defining said 
modified grammar rules and the relationships with events 
in the or each further modality input. 

15 

10. A data processing method comprising the steps: 

receiving a data structure defining grammar rules 

for recognition of a natural language input; 

analysing the data structure to identify positions 
20 in the grammar rules at which pauses can occur in the 

natural language input; and 

generating a modified data structure defining 

modified grammar rules for recognition of a natural 

language input with pauses therein. 
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11. A data processing method according to claim 10, 
wherein the analysing step includes identifying the 
positions in accordance with pause criteria for the 
natural language input. 

12. A data processing method according to claim 10 or 
claim 11, wherein the generating step includes adding 
marker means to said data structure to identify the 
positions— in— the__ grammar_r.ules_a.t_which_pauses_can_.occ.ur- 
in the natural language input . 

13. A data processing method according to any one of 
claims 10 to 12 wherein the generating step includes 
fragmenting the grammar rules in accordance with said 
identified positions to generate sub grammar rules to 
form said modified data structure. 

14. A data processing method according to claim 13 
wherein the generating step includes forming a 
hierarchical structure using said sub grammar rules to 
form said modified data structure. 

15. A data processing method according to any one of 
claims 10 to 14, wherein the receiving step comprises 
receiving a data structure defining grammar rules for use 



in speech recognition of a natural language speech input, 
and the generating step comprises generating a modified 
data structure defining modified grammar rules for speech 
recognition of a natural language speech input with 
pauses therein. 

16. A data processing method according to any one of 
claim 10 to 15, wherein the receiving step comprises 
xec e i v-ing— a— da ta— s-tr-uc t u r e— de fin i n g — gramma x_r-u 1 e s — f or- 
recognition of a natural language input as a first 
modality input in conjunction with associated events in 
at least one further modality input, said data structure 
defining the association between events in each modality 
input, events in said first modality input comprising 
units in the natural language. 

17. A data processing method according to claim 16, 
wherein the analysing step comprises identifying said 
positions in the grammar rules based on events in at 
least one said further modality input. 

18. A data processing method according to claim 16 or 
claim 17, wherein the generating step includes generating 
a further modified data structure defining said modified 



grammar rules and the relationships with events in the or 
each further modality input. 

19. Apparatus for generating data in a computer usable 
form, the apparatus comprising: 

receiving means for receiving a natural language 
input with a number of pauses therein; and 

recognition means for recognising said natural 

-language input— using the modified data S-tructure 

generated using the method of any one of claims 10 to 18 
to generate data in computer usable form. 

20. Apparatus according to claim 19 , wherein said 
recognising means comprises speech recognition means for 
recognising a natural language speech input. 

21. A method of generating data in a computer usable 
form, the method comprising receiving a natural language 
input with a number of pauses therein; and 

recognising said natural language input using the 
modified data structure generated using the method of any 
one of claims 10 to 18 to generate data in computer 
usable form. 



22. A method according to claim 21, wherein the 
recognising step comprises speech recognition of a 
natural language speech input. 

23. Apparatus for generating data in a computer usable 
form, the apparatus comprising: 

first modality receiving data generated for a 
natural language input by the apparatus of claim 19 or 

claim__2 0_, said data comprising recognised units of the 

natural language and comprising data of a first modality 
input; 

further modality receiving means for receiving data 
identifying events in at least one further modality 
input ; 

data structure receiving means for receiving a 
further modified data structure defining modified grammar 
rules and the relationships with events in the or each 
further modality, said further modified data structure 
having been generated using the method of claim 18; 

analysing means for analysing the first modality 
input data and the or each further modality input data to 
determine if they match with any said modified grammar 
rule and related events in the or each further modality; 
and 



generating means for generating computer usable data 
in dependence upon said analysis by said analysing means, 

24. Apparatus according to claim 23, wherein said first 
modality receiving means is adapted to receive 
recognition data comprising an ordered list of likely 
natural language units to accompany the most likely 
natural language for each natural language unit 

recognised, and_s.aid_analy-S-ing— means— is adapted— to— use 

said ordered list when the most likely natural language 
units do not result in a match with any modified grammar 
rule and related events in the or each further modality. 

25. Apparatus according to claim 23 or claim 24 wherein 
said first modality receiving means is adapted to receive 
speech recognition data. 

26. A method of generating data in a computer usable 
form, the method comprising: 

a first receiving step of receiving data generated 
for a natural language input by the method of claim 21 or 
claim 22, said data comprising recognised units of the 
natural language, and comprising data of a first modality 
input; 



a second receiving step of receiving data 
identifying events in at least one further modality 
input; 

a third receiving step of receiving a further 
modified data structure defining modified grammar rules 
and the relationship with events in the or each further 
modality, said further modified data structure having 
been generated during the method of claim 18; 

analysing the fi rst modalit y input data and the or 

each further modality input data to determine if they 
match with any said modified grammar rule and related 
events in the or each further modality; and 

generating computer usable data in dependence upon 
said analysis. 

25. A method of according to claim 26 wherein the first 
receiving step comprises receiving recognition data 
comprising an ordered list of likely natural language 
units to accompany the most likely natural language unit 
for each natural language unit recognised, and the 
analysis step includes using the ordered list when the 
most likely natural language units do not result in a 
match with any modified grammar rule and related events 
in the or each further modality. 
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28. A method according to claim 26 or claim 27 , wherein 
the first receiving step receives speech recognition 
data . 

5 29. Processor implementable instructions for controlling 

a processor to carry out the method of any one of claims 
10 to 18, 21, 22 or 26 to 28. 



10 



3D- A carrier medium. _f_or carrying the processor 

implementable instructions according to claim 29. 



ABSTRACT 



NATURAL LANGUAGE INPUT METHOD AND APPARATUS 



A system is disclosed for generating a modified data 
structure defining modified grammar rules for the 
recognition of a natural language input with pauses in 
which grammar rules for recognition of a natural language 
a re~an a ly^ ed_t o- J_ de nt i^y^ 

at which pauses can occur in the natural language. A 
modified data structure is generated defining the 
modified grammar rules in dependence upon the analysis. 
The modified data structure is used to improve the 
accuracy of recognition of a natural language input with 
pauses. Where the natural language input is used in 
conjunction with a second modality input, the analysis is 
performed to identify the positions of pauses in 
dependence upon the second modality inputs. 
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